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ABSTRACT 


Feedback on student answers and even during intermediate 
steps in their solutions to open-ended questions is an im- 
portant element in math education. Such feedback can help 
students correct their errors and ultimately lead to improved 
learning outcomes. Most existing approaches for automated 
student solution analysis and feedback require manually con- 
structing cognitive models and anticipating student errors 
for each question. This process requires significant human 
effort and does not scale to most questions used in home- 
works and practices that do not come with this information. 
In this paper, we analyze students’ step-by-step solution pro- 
cesses to equation solving questions in an attempt to scale 
up error diagnostics and feedback mechanisms developed for 
a small number of questions to a much larger number of 
questions. Leveraging a recent math expression encoding 
method, we represent each math operation applied in so- 
lution steps as a transition in the math embedding vector 
space. We use a dataset that contains student solution steps 
in the Cognitive Tutor system to learn implicit and explicit 
representations of math operations. We explore whether 
these representations can i) identify math operations a stu- 
dent intends to perform in each solution step, regardless of 
whether they did it correctly or not, and ii) select the ap- 
propriate feedback type for incorrect steps. Experimental 
results show that our learned math operation representa- 
tions generalize well across different data distributions. 
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1. INTRODUCTION 


Math education is of crucial importance to a competitive 
future science, technology, engineering, and mathematics 
(STEM) workforce since math knowledge and skills are re- 
quired in many STEM subjects [11]. One important way 


*This work is supported by the National Science Foundation 
under grant IIS-1917713. 


Mengxue Zhang, Zichao Wang, Richard Baraniuk and Andrew Lan “Math 
Operation Embeddings for Open-ended Solution Analysis and Feedback”. 
2021. In: Proceedings of The 14th International Conference on Educational 
Data Mining (EDM21). International Educational Data Mining Society, 
216-227. https://educationaldatamining.org/edm2021/ 

EDM ’21 June 29 - July 02 2021, Paris, France 


to help struggling students improve in math is to diagnose 
errors from student answers to math questions and deliver 
personalized support to help them correct these errors [1]. 
In short-answer questions, feedback of various types [39] can 
be deployed according to the specific incorrect final answers 
students submit, while in open-ended questions, feedback 
can be deployed at intermediate solution steps according to 
the specific actions they take and their outcomes [22]. In 
traditional educational settings, this feedback process relies 
on teachers going over student work, identifying errors, and 
providing feedback [15], which results in a labor-intensive 
process and a slow feedback cycle for students. Such a set- 
ting is even more limited as a result of the COVID-19 pan- 
demic, which introduced new barriers to face-to-face inter- 
actions between teachers and students. 


In intelligent tutoring systems, a more scalable approach to 
math feedback is to automatically deploy feedback based on 
students’ final answers or certain incorrect intermediate so- 
lution steps. For example, in ASSISTments [12], teachers 
can create hints and feedback messages for specific incorrect 
student answers to short-answer questions that they antici- 
pate [28], which the system can automatically deploy when 
students submit these incorrect answers. This crowdsourc- 
ing approach efficiently scales up teachers’ effort so that they 
can benefit a large number of students without putting in 
additional effort. In many other systems such as Cognitive 
Tutor [34] and Algebra Notepad [27], researchers use cogni- 
tive models to anticipate student errors as results of buggy 
production rules or insufficient knowledge on key math con- 
cepts [20, 24]. They then develop corresponding feedback 
for intermediate solution steps in multi-step questions (e.g., 
those on equation solving). This cognitive model-based ap- 
proach requires significant effort by domain experts and has 
shown to be highly effective in large-scale studies. 


However, these approaches for student feedback are still lim- 
ited in their generalizability to many math questions de- 
ployed in daily homeworks and practices. For the teacher 
crowdsourcing approach, hint and feedback messages have 
to be written for each individual question (or group of ques- 
tions generated from the same template with different nu- 
merical values). For the cognitive model-based approach, a 
rigorous solution process has to be specified for each ques- 
tion with annotations on the math operations that should 
be applied at each solution step. However, questions used 
in many real-world educational settings do not come with 
such information; teachers simply adopt them from sources 
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such as textbooks and open education resources and assign 
them to students without developing corresponding feedback 
mechanisms. Moreover, past research has shown that a large 
portion of incorrect student answers cannot be anticipated 
by cognitive models [43], teachers/domain experts [8], or nu- 
merical simulations [37]. Therefore, it may be hard for high- 
quality feedback developed for questions used in intelligent 
tutoring systems to generalize to questions in the wild. 


1.1 Contributions 

In this paper, we develop data-driven methods that enable 
us to analyze step-by-step solutions to open-ended math 
questions. In contrast to existing methods that rely on a 
top-down approach, i.e., defining the structure of the so- 
lution process and anticipating student errors, we propose 
a bottom-up approach, i.e., using learned representations of 
math expressions and math operations to predict i) math 
operations in student solution steps and ii) the appropriate 
feedback for incorrect solution steps. We restrict ourselves 
to the specific domain of equation solving where the solu- 
tion process consists of applying specific math operations 
between math expressions in consecutive steps; other sub- 
domains of math such as algebra word problems [45] and 
questions involving graphs and geometry [16] are left as fu- 
ture work. Specifically, our contributions are: 


e First, we characterize math operations by how they 
transform math expressions in the math embedding 
space in each solution step. We leverage recent work 
on learning math symbol embeddings from large-scale 
scientific formula data [46] to encode math expressions 
in student solutions: each math expression is mapped 
to a point in the math embedding vector space. We use 
synthetically generated data as well as solution steps 
generated by real students to learn the representation 
of each math operation. We explore several meth- 
ods for learning both implicit and explicit math op- 
eration representations: a classification-based method 
that does not explicitly impose a structure on math op- 
erations, a linear model that assumes each operation is 
characterized by an additive vector in the embedding 
space, and a nonlinear model where math operations 
live in their own, interconnected embedding spaces. 


e Second, we apply these math operation representation 
learning methods to a real-world student step-by-step 
solution dataset collected while student learn equation 
solving in an intelligent tutoring system, Cognitive Tu- 
tor [34]. We validate our math operation representa- 
tion learning methods via two tasks: i) predicting the 
specific math operation the student intended to ap- 
ply in a solution step from the math expressions be- 
fore and after the step and ii) predicting the appro- 
priate feedback deployed to students from the incor- 
rect math expressions they enter. Quantitative results 
show that tree embedding-based math expression en- 
coding methods outperform other encoding methods 
since they are able to explicitly capture the seman- 
tic and structural characteristics of math expressions. 
They also have better generalizability across different 
data distributions and remain effective across different 
question difficulty levels and even when student solu- 
tions steps contain errors. 


Question 
Solve for vz: 4n+32+2=12-—5-9 


Solution steps Predicted math operations 
and feedback 
(feet) Peay 
1. COMBINE_ADD (100%) 
Tea+9=T-2@ 
dl ® 2. COMBINE_ADD (60%), ADD_SIDE (40%) 
Te+9+a=T7 
82 = —2 
1 @® 3. DIV_SIDE (100%) 
82/8 = —2/8 
dt ® 4. COMBINE_MUL (92%), COMBINE_ADD (8%) 
a= —2/8 
1® 5 
cy 
S 3 


Figure 1. Demonstration of the generalizability of our math 
operation representations to other data sources for a solution 
process provided on Algebra.com. Our methods can success- 
fully predict the math operations applied in each step and 
the appropriate feedback type in an incorrect step. 


1.2 Use Case 


Before diving into the technical details, we first illustrate 
a potential use case for our math operation representation 
learning methods and corresponding operation/feedback 
classifiers. Our goal is to transfer expert designs in intel- 
ligent tutoring systems for math education to questions in 
the wild. Specifically, we apply the math operation rep- 
resentations learned from student solution steps and corre- 
sponding labels (step name, feedback message) in the highly 
structured Cognitive Tutor system to environments that are 
not highly structured. Figure 1 shows the solution process 
to an equation solving question on Algebra.com’ and the 
corresponding math operation and feedback predictions at 
each step. We see that our math operation representation 
learning methods can accurately predict the math opera- 
tions applied in solution steps 1, 3, and 4 using the opera- 
tion names provided in the Cognitive Tutor system. Even 
in step 2 where two different math operations are combined 
into a single step, i.e., 


7e#+9=7-2 

{| ADD x TO BOTH SIDES 
Ye+94+u=7-2+2 

J COMBINE TERMS ON RIGHT SIDE 
Tz+94+2=7, 


despite only training on steps in Cognitive Tutor that involve 
only one math operation, the classifier is able to recognize 
both of them with high predictive probability for both. We 


'The original question and the solution process can be 
found at https://www.algebra.com/algebra/homework/ 
equations/Equations.faq.question.4872.html. 
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also change one of the solution steps, i.e., step 5, to make 
it incorrect and test our feedback classifier. In this case, 
the classifier is able to recognize the error in this step and 
find the corresponding feedback types in Cognitive Tutor. 
This potential use case demonstrates the utility of our math 
operation representation learning methods: by transferring 
knowledge learned in well-designed, highly-structured sys- 
tems such as Cognitive Tutor, especially on what feedback 
to deploy for each student error, to other domains such as 
online math Q&A sites, we are scaling up the effort domain 
experts put into the design of these feedback mechanisms. 


2. RELATED WORK 


One related body of work in math education that studies 
student solution processes to identify student strategies and 
assess errors. Specifically, [33] uses inverse Bayesian plan- 
ning to learn solution strategies (i.e., policies) in equation 
solving and capture student misunderstandings in a Markov 
decision process framework. Our work focuses on a differ- 
ent aspect of the solution process: the representation of the 
math expressions at each solution step and the modeling 
of the transitions between different math expressions under 
math operations. [9] uses basic math operations to con- 
struct programs to understand errors that students make in 
their solutions to arithmetic questions. Our work focuses 
on equation solving, which is a more difficult problem in 
which students responses are are more diverse and are less 
structured than arithmetic calculations. 


Another related body of work focuses on learning representa- 
tions of student answers to short-answer questions. [21] an- 
alyzes incorrect student answers across multiple questions, 
learn representations of errors, and generalize misconception 
feedback across questions. Our work analyze the full math 
expressions in intermediate solution steps while their work 
represents short answers according to the frequency they 
occur in an answer pool. [8] uses trained word embeddings 
to represent short answers for automated grading purposes. 
Our work focuses on learning transitions of math expressions 
across solution steps instead of learning representations of 
only the final answer. 


In domains other than math education, there exist methods 
for automated feedback generation, including programming 
[30, 31, 40] and essays [35]. However, transferring these 
methods to math solutions is not trivial since i) open-ended 
math solutions are less structured than programming code 
and ii) data-driven representations of math symbols have not 
been developed until recently [46] whereas such representa- 
tions have been studied for a long time in natural language 
processing [6, 7, 26]. 


Another body of remotely-related work focuses on using 
computer vision techniques to identify math expressions 
from images for similar math expression retrieval [29], turn- 
ing hand-written math expressions into TX [47], and au- 
tomatically identifying and correcting student errors [14]. 
These works often bypass the inherent structure of math 
expressions and directly use an end-to-end model for their 
tasks, which means that they cannot be used to analyze 
student knowledge. Nevertheless, these techniques can be 
used to build large-scale datasets containing hand-written 
student solutions which we can use in the future. 


3. BACKGROUND: EMBEDDING MATH 
EXPRESSIONS INTO VECTOR SPACES 


In this section, we provide an overview of a recent method 
that we developed to embed math expressions into a vec- 
tor space, i.e., a math embedding space. Doing so turns 
discrete, symbolic math expression representations into con- 
tinuous, distributed representations [2], which enables us to 
manipulate math expressions in a manner compatible with 
modern machine learning methodologies. 


Our embedding method is a tree-structured encoder illus- 
trated in Figure 2. The key observation is that any math 
expression has a corresponding symbolic tree-structured rep- 
resentation in the operator tree format. In the operator tree, 
the non-terminal (non-leaf) nodes are math operators, i.e., 
addition and subtraction, and terminal (leaf) nodes are num- 
bers or variables; See Figure 2 for an illustration. Thus, an 
operator tree explicitly captures the semantic and structural 
properties of a math expression. A number of existing works 
have demonstrated the superior performance of using oper- 
ator tree representations of math expressions compared to 
other math expression representations in applications such 
as automatic math word problem solving [32, 48, 51] and 
math formulae retrieval [5, 25, 49, 50]. 


Therefore, we built a math expression encoder that lever- 
ages the operator tree representation of math expressions. 
Specifically, during the encoding process, it first converts a 
math expression into its corresponding tree format, using the 
parser introduced in [5]. It then linearizes the tree by depth 
first search that enables us to process nodes as a sequence in 
which each math symbol is associated with its own trainable 
embedding. Next, it leverages positional encoding, similar 
to [44, 38], to retain the relative position of each node in the 
tree. The output of our encoder is a fixed-dimensional em- 
bedding vector that represents the input math expression, 
which we will use to learn representations of math operations 
for the math operation classification and feedback prediction 
tasks. We pretrain the encoder on a large corpus of math 
expressions extracted from Wikipedia and arXiv articles and 
demonstrated superior performance in reconstructing math 
expressions (and scientific formulae) and retrieving similar 
expressions. See the anonymized version of our work at [46]. 
We will refer to the trained encoder as the math expression 
encoding method in what follows. 


4. LEARNING REPRESENTATIONS OF 
MATH OPERATIONS 


In this section, we detail methods we use to learn both im- 
plicit and explicit math operation representations by study- 
ing how they transform math expressions in each solution 
step in the math embedding space. In these methods, we 
leverage the math expression encoding method developed in 
our prior work that we reviewed above to embed math ex- 
pressions into vectors and work with these embedding vec- 
tors. However, since these embeddings are trained on math 
expressions that are very different from those occurring in 
actual student solution steps, we use an additional train- 
able, fully-connected neural network to adapt these embed- 
dings to our dataset, following a popular approach in natural 
language processing [13]. Specifically, we have e = gy(m) 
where m and e are the embedded vector of a math expression 


218 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 


math expression operator tree 


e=22-4 — 


linearized tree 


© ——* SG@QOKOO®D —— | encoder 


math embedding 


: 


Figure 2. Illustration of the math expression encoding method that we employ in this work. 


in our dataset before and after the adaptation, respectively. 
7 denotes the set of parameters in the fully-connected net- 
work that we will learn during the training process. 


We define a step in a student’s solution to open-ended math 
questions as a tuple (€1,€2,z), where z € Z is the math 
operation applied in this step, with Z denoting the set of 
possible math operations. €; € E and €2 € E denote the 
math expressions involved in this step before and after ap- 
plying this math operation, i.e., the step can be expressed as 
E€; —> Eg. E denotes the set of all unique math expressions 
(across all steps in a dataset). For simplicity, we assume that 
only one math operation is applied in each step; an extension 
to cases where multiple math operations is trivial and will 
be discussed in what follows. e; € R? and eo € R” are the 
fine-tuned embedding vectors that correspond to math ex- 
pressions €; and €2, respectively, where D is the dimension 
of the embedding. 


4.1 Math Operation Classification 
The first task we will study in this paper is to classify the 
math operation applied in a solution step given the math 
expression embeddings before and after appliying it, e: and 
e2. The same notations and approaches also apply to our 
second task, feedback classification. This task can simply be 
solved using a supervised learning method, e.g., a regression 
model where the predicted probability of predicting a math 
operation 2 is given by 

p(é = z) = softmax(vz [e1, e3]"), 
where softmax(-) is the softmax function for multi-label clas- 
sification [10]. vz is a parameter vector associated with each 
math operation z, which is used to compute an inner product 
with the concatenation of e; and e2 before being fed into the 
softmax function. On a training dataset with given tuples 
(e1, 2, 2), we can learn the parameters (v.) by minimizing 
the cross-entropy loss [10] between the predicted math op- 
eration 2 and the actual math operation. This approach can 
be seen as learning implicit representations of math expres- 
sions since they are captured by the classifier parameters. 


4.2 Learning Math Operation 


Representations 

The classification approach we detailed above can help us 
classify the math operation applied in a solution step but 
falls short on learning explicit representations of math oper- 
ations. The latter is important, however, to help us under- 
stand students’ math solution processes and diagnose their 
errors. We now detail a series of methods for us to learn 
explicit representations of math operations. 


4.2.1 Translating embeddings 


TransE TransR 
' eSpace ' zSpace { 
q ReLU(Mze;} hz 
i eA H t ery ' 
E> 
: ‘ erg 
Encoder ca a 
Math expression] Math operation 
embeddin embeddin: 
InputPair : (Ey & z) 


Example: (3@+2x2=6 5x=6 COMBINE_ADD) 


Figure 3. Illustration of the TransE and TransR frameworks. 
TransE puts the embeddings of equations e1, e2, and math 
operation z in the same embedding space, whereas 'TransR 
puts them in their own embedding spaces. 


We will leverage the translating embedding (TransE) frame- 
work [3] that has found success in embedding entities 
and characterizing relationships between entities in multi- 
relational data. Our key assumption here in this framework 
is that math operations are linear and additive, i.e., the rela- 
tionship between math expressions before and after a math 
expression satisfy 


eo Sei + hz, 


where hz € R” is the embedding of the math operation z. In 
other words, we assume that the effect of a math operation 
is characterized by the difference in the embedding vectors 
between the math expressions before and after it in a single 
step; adding it to the embedded vector of €; results in the 
embedded vector of €2 after the step. 


To learn these math operation embeddings from data, we 
use two loss functions. The first loss function promotes this 
linear and additive relationship between embeddings of the 
math expressions and operations on the training data. To 
this end, we define a distance function as d(e1,e2,h.) = 
je: +h, — eg||3 and define the loss function as 


I1= > d(es,e2,h.). 


(€1,€2,2) 


The second loss function pushes counterfeit step tuples that 
are generated by replacing elements in an observed step tu- 
ple with other ones in the dataset to not satisfy the afore- 
mentioned linear and additive relationship. To this end, we 
minimize the pairwise marginal distance ranking-based loss 
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given by 


ue SOS 


(E1,€2,2)€S (E1,E5,2)ESLe, 5 2) 


[y + d(e1, e2, hz) — d(e},e),h2)]4+, 


where [2], = x when x > 0 and 0 otherwise and y > 0 
is a hyper-parameter that controls the margin of the dis- 
tance ranking. S denotes the set of steps in the dataset and 

(€1,€2,2) i8 a set of counterfeit steps that are perturbed ver- 
sions of the actual step (€1,€2,z), generated by randomly 
replacing one of the triplet elements in the step by a differ- 
ent math expression or math operation from another step, 
iLe., 


S(e,,£9,2) ~ AUBUC, 
where A = {(€{,€2,z) :€, #€1 € E} 

B= {(&1,€3,z): €  €2 € E} 

C={(E1,€2,2'): 2° Fz2E Zh. 


Intuitively speaking, our objective encourages the distance 
function calculated on an actual tuple in the dataset to be 
smaller than that calculated on a perturbed version of it. 
Figure 3 illustrates the whole process. 


The final loss function that we minimize is simply the combi- 
nation of these two loss functions as L = LZ; + L2. Using the 
learned embeddings of each math operation, we can classify 
them from the math expressions €; and €2 using the nearest 
neighbor classifier, i.e., 2 = argmin,d(e1, e2, hz). 


4.2.2 Learning Entity and Relation Embeddings 
Despite potentially exhibiting excellent interpretability, 
TransE’s assumption that math operations are linear and 
additive in the math expression embedding space may be 
too restrictive. This assumption puts math operations are 
vectors in the same latent space where similar math expres- 
sions will be close to each other. However, different math 
operations are fundamentally different and can transform 
the same math expression into dramatically different math 
expressions that are far apart in the embedding space. For 
example, different math operations can focus on transform- 
ing different parts of the same math expression. The steps 
(3+5+2¢ =x+1, 84+22 =2+1, combine similar terms) 
and (3+5+2% = 2#+1,34+5+2r%-2x2 =2r+41 
x, subtract from each side) have the same starting math ex- 
pression €;. In the first step, only similar terms on the left 
hand side of the equation are combined, regardless of the 
other side of the equation. In the second step, we subtracted 
az from both sides of the equation, which is a consequence 
of the equality symbol in the equation, which means that 
subtracting the same term on both sides of the equation but 
not what exactly is on each side. Therefore, TransE’s lin- 
ear and additive assumption means that the resulting €2 in 
these steps will be very different due to the different math 
operations applied, which conflicts with the observation that 
they are very similar. To address this limitation, we explore 
the Learning Entity and Relation Embeddings (TransR) [23] 
model, which models math expressions and math operations 
in different spaces, i.e., there will be a shared embedding 
space for all math expressions but separate relation spaces 
for different math operations. 


TransR learns the embeddings of math operations by pro- 
jecting them to their corresponding relation spaces and then 
learning translations between those projected expressions. 
For each math operation z, we set a projection matrix 
M. € R?*” that projects a math expression to its rela- 
tion space. To make this projection nonlinear, we apply the 
rectified linear unit (ReLU) activation function [10] to it and 
define the corresponding distance function as 


d.(e1,e2, hz) = ||ReLU(M-e1) + hz — ReLU(M-e2)||3. 


Correspondingly, the two loss functions in the TransR frame- 
work are given by 


I1n= 5° d.(e1,e2,h:), 


(E1,€2,2) 


ne Dy 


(E1,€2,2)€8 (E},€5,2)ESle, 6,2) 


ly + dz(e1, 2, hz) _ dz(e1,e9, 1 |e 


The projection matrices M., Vz € Z are included as part of 
the trainable parameters. The rest of the training and re- 
sulting math operation classification procedure remains un- 
changed from the TransE framework. 


5. EXPERIMENTS 


We now detail a series of quantitative and qualitative exper- 
iments that we have conducted to validate the learned rep- 
resentations of math operations. Using the Cognitive Tutor 
2010 equation solving (CogTutor) dataset,” we focus on two 
tasks: i) classifying the math operation a student applies in 
a solution step and ii) classifying the feedback category cor- 
responding to certain types of incorrect steps, from the math 
expressions the student enters before and after the step. 


5.1 Dataset 

We use the CogTutor dataset which we accessed via the 
PSLC DataShop [19]. The dataset contains detailed tu- 
tor logs generated as students in a school use the Cog- 
nitive Tutor system [34] for their Algebra I class. These 
logs contain the students’ step-by-step solutions to equa- 
tion solving problems, where each step is a tuple with 
three elements: a math expression €, at the beginning of 
the step, the step name z, i.e., the math operation the 
student selected to apply to this math expression, and 
the resulting math expression €2 after the step. Students 
can select math operations from a built-in list in Cogni- 
tive Tutor: COMBIN_ADD, COMBINE_MUL, ADD_SIDE, 
SUB_SIDE, MUL_SIDE, DIV_SIDE, and DISTRIBUTE; see 
Table 1 for an illustration of these operations and some ex- 
amples of the corresponding math operations before and af- 
ter them in a step. 


There are a total of 50,406 steps in this dataset that can be 
further divided into three subsets according to their out- 
comes: OK (43,413 steps), ERROR (6,377 steps), and BUG 
(5,744 steps). The OK subset contains steps that are cor- 
rect, i.e., the student both selected the correct math op- 
eration and arrived at the correct math expression. The 
BUG and ERROR subsets contain incorrect student steps, ei- 
ther because the operation they selected was incorrect or 


2https://pslcdatashop.web.cmu.edu/Dataset Info? 
dataset Id=660 
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Step (Math operation) | Description 


Example 


COMBINE_ADD 


combine two similar terms with add/sub operator 


324+ 2x4 > 5a 


COMBINE_MUL 


combine two similar terms with multiply/divide operator | #* a — x” 


ADD_SIDE add a math term on each side e=1l724+1=141 
SUB_SIDE subtract a math term on each side z=1l>42-1=1-1 
MUL-_SIDE multiply a math term on each side gt=1lor+2=2 
DIV_SIDE divide a math term on each side e=1—>2/2=1/2 
DISTRIBUTE distribute(expand) the terms (e+l)a>ouxn+a 


Table 1. Detailed descriptions and examples for each math operation in the CogTutor dataset. 


because they selected the correct operation but did not ap- 
ply it correctly, i.e., arriving at an incorrect math operation 
after the step. The difference between these two subsets is 
that BUG contains steps that fit one of the predefined er- 
ror templates in the Cognitive Tutor system; in this case, 
the system can automatically diagnose the error and deploy 
a predefined feedback. On the other hand, ERROR contains 
incorrect steps that Cognitive Tutor could not automati- 
cally diagnose the underlying error. The OK subset can be 
further split into six predefined difficulty levels (named as 
ES_01,ES_02, ES_03 ,ES_04, ES_05, and ES_07), with 2, 068, 
7,546, 8,183, 13,393, 5,484, and 2,801 steps, respectively. 
We do not further split the BUG and ERROR subsets for the 
math operation classification task due to their limited sizes. 


To learn the representation of math operations, we need 
examples of how they transform one math expression into 
another. However, the CogTutor dataset may not contain 
enough data that is rich in both quantity and diversity for 
neural network-based models to learn from. Therefore, we 
designed a synthetic data generator stemming from the math 
question answering dataset created by DeepMind [36]. The 
generator can generate steps by first generating the initial 
math expression and then applying math operations listed 
in Table 1 to arrive at a resulting math expression. We 
have full control over the generated steps through the en- 
tropy, degree, and flip parameters. Increasing entropy intro- 
duces more complexity to the math expressions as numer- 
ical constants generated get larger. Increasing the degree 
parameter introduces monomials of higher degrees and also 
adds more terms in the math expression. Finally, the flip 
parameter allows us to control which side of an equation 
has a higher chance to be more complicated than the other. 
Tuning these parameters within this flexible synthetic data 
generation method enables us to generate a large amount of 
steps that closely resembles those in the CogTutor dataset. 


5.2. Methods 


To fully evaluate the effectiveness of our math operation 
representations, we also experiment with two other ways of 
encoding math expressions commonly used in natural lan- 
guage processing tasks, in addition to the tree embedding- 
based and translation-based encoder that we introduced in 
Section 4.2. These two encoders include a gated recurrent 
unit (GRU)-based encoder [4] and a convolutional neural 
network (CNN)-based encoder [17]; we will use the output 
of these encoders to replace ler, e; |" as input to the clas- 
sifier detailed in Section 4.1. 


Specifically, these two encoders first concatenates the two 
math expressions before and after the step, i.e., E = [€1, Eo]. 


For each character x; in €, we compute its embedding 
r= W’ onehot(z:) ; 


where W is a trainable embedding matrix. Using these char- 
acter embeddings, the GRU encoder computes 


hte = GRUo (at, ht-1), 


where @ represents all the trainable parameters in GRU. We 
then replace [e7, e/]7 with hr as input to the classifier 
where T is the total number of characters in €. Similarly, 


the CNN encoder computes 
h = max_pool(CNNg((x1,--- ,x7])), 


where CNNg represents a 2D CNN with parameters ¢ and 
max_pool is a 1D max pooling operator. Combined, they 
return a fixed dimensional feature vector h that replaces 
let, e3]” as input to the classifier. For each of these two 
models, we learn its parameters jointly with the classifica- 
tion task using the cross-entropy loss that we described in 


Section 4.1. 


Overall, we test five different methods for the math oper- 
ation classification and feedback classification tasks. The 
first three methods use different encoding methods in con- 
junction with a classifier: i) using the GRU encoder to en- 
code math expressions as input to the classifier, which we 
dub GRU+C, ii) using the tree embedding-based encoder in- 
stead, which we dub TE+C, and iii) using the CNN encoder 
instead, which we dub CNN-+C. These methods do not learn 
explicit representations of math operations. The next two 
methods use the TransE and TransR frameworks to learn 
these representations using tree embeddings: iv) using tree 
embedding-based encoder as input to the TransE framework 
in conjunction with a nearest neighbor classifier, which we 
dub TE+TransE, and v) using the TransR framework in- 
stead of the TransE framework to study math operations in 
multiple relation spaces, which we dub TE+TransR. 


5.3. Experimental Setup 

We first test our math operation representation learning 
methods on the OK subset via 5-fold cross-validation, i.e., 
training on 80% of steps in the subset to learn representa- 
tions of math operations and testing them on the remaining 
20%. We also test the generalizability of the learned repre- 
sentations to incorrect steps, i.e., replace the test set with 
the ERROR and BUG subsets, and check whether we can still 
recognize the math operation a student applied in an incor- 
rect step. The results are detailed in Section 5.4.1. 


Since the distribution of math expressions in the 0K, ERROR 
and BUG data subsets are mostly similar with minor differ- 
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OK ERROR BUG 
GRU+C 99.18 + 0.23 | 93.87 + 0.66 | 95.89 + 0.63 
TE+C 99.82 + 0.04 | 93.30 + 0.65 | 95.38 + 0.62 
CNN+C 95.37 + 0.44 | 86.82 + 1.38 | 91.02 + 0.59 
TE+TransE | 96.27+0.17 | 86.32 + 1.23 | 84.21 + 2.13 
TE+TransR | 99.17 +0.21 | 91.28+41.12 | 91.314 1.87 


Table 2. Math operation classification accuracy for all meth- 
ods training on the OK subset of the CogTutor dataset and 
testing on different data subsets. Accuracy is high across 
the board, while GRU-based encoding and tree embedding- 
based encoding in conjunction with a classifier result in the 
best performance. 


ences, the previous experiment does not give us a good idea 
on the generalization ability of our math operation repre- 
sentation learning methods. Therefore, we further divide 
the OK subset into six smaller subsets, each corresponds to 
a different difficulty level (with different structure and com- 
plexity) according to questions within it, and test the gen- 
eralizability of the learned math operation representations. 
The results are detailed in Section 5.4.2. In practice, so- 
lution step data generated by real students is often limited. 
Therefore, we conduct two more experiments to test whether 
synthetically generated steps can help us learn math opera- 
tion representations that generalize to real data. First, we 
repeat the experiments above using synthetically generated 
steps as the training set. This synthetic training set consists 
of 1,000 steps for each math operation defined in Table 1 
(adding up to a total of 7,000 across different difficulty lev- 
els). The results are detailed in Section 5.4.3. Second, to 
study the impact of synthetically generated data when real 
data is limited, we pre-train the math operation represen- 
tations with synthetic data, fine-tune on a small amount of 
real data from each difficulty level in the OK subset, and test 
on the rest. The results are detailed in Section 5.4.4. 


To test the ability of our learned math operation representa- 
tions on recognizing student errors, we use them to classify 
feedback types provided by CogTutor in the BUG data subset. 
Examples of such errors include when a student calculated 
the wrong simplification result, used the wrong sign in front 
of terms, and applied useless/unlogical steps to solve the 
problem, etc. The results are detailed in Section 5.4.5. 


We use Adam optimizer [18] with learning rate 0.001, batch 
size 64 and run 10 training epochs for each experiment. 
The math expression encoder outputs length-512 embed- 
ding vectors for each math expression, which we adapt to 
length-32 embedding vectors dimensions using a trainable 
fully-connected neural network. All of our experiments were 
conducted on a server with a single Nvidia RTX8000 GPU. 


5.4 Results and Discussion 


5.4.1 Generalizing to incorrect steps 

Table 2 shows the averages and standard deviations of math 
operation classification accuracy for every method we ex- 
perimented with using the OK subset as the training set. As 
expected, testing on the ERROR and BUG subsets result in 
slightly lower (5-10%) math operation classification accu- 
racy for all methods since the training set does not contain 


incorrect steps. However, even on steps that are incorrect, 
these methods can still effectively identify the math opera- 
tion a student intended to apply (with up to 95% accuracy), 
suggesting that they may be applicable to fully open-ended 
question solving solutions that are not highly structured, un- 
like those in Cognitive Tutor, to provide feedback to teachers 
on students’ solution approaches. 


We observe that using GRUs and tree embeddings as repre- 
sentations for math expressions and applying a classification 
method on top of these representations result in similar per- 
formances; GRUs slightly outperform tree embeddings in 
cases where we use the ERROR and BUG subsets as the test 
set while tree embeddings slightly outperform GRUs in the 
case where we use a part of the OK subset as the test set. 
Using CNNs to encode math expressions as input to a clas- 
sifier results in worse performance, suggesting that they do 
not capture the semantic and structural information in math 
expressions as well as GRUs and tree embeddings. As ex- 
pected, using tree embeddings under the TransE and TransR 
frameworks leads to worse performance than the first two 
methods, with TransE achieving low performance (especially 
on the BUG subset) and TransR achieving comparable per- 
formance to the classification-based methods on the OK sub- 
set but lower performance on the ERROR and BUG subsets. 
This result can be explained by the additional structural 
restriction that math operations are represented as linear 
and additive in some embedding space in the TransE frame- 
work, which makes it less robust against incorrect student 
solution steps. Using the TransR framework mitigates this 
problem due to its use of different relation spaces for each 
math operation. 


These methods perform similarly in the math operation clas- 
sification task on real data largely due to the limited varia- 
tion and complexity in the math expressions. The Cognitive 
Tutor system limits the degrees of freedom in a students’ 
response by splitting an open-ended step into the separate 
actions of selecting a single math operation and entering 
the resulting math expression, which limits the variability 
in the data. In the next experiment, we see that when we 
control against different levels of complexity in these math 
expressions and forcing these methods to generalize across 
complexities, their performance vary significantly. 


Figure 4 visualizes the confusion matrix for math operation 
classification on the OK subset and the pairwise euclidean 
distances between math operation embeddings learned via 
the TransE framework using tree embeddings for math ex- 
pressions. Rows correspond to the true math operations 
applied in steps and columns correspond to predicted ones. 
Percentages in the confusion matrix (Figure 4a) are nor- 
malized w.r.t. the number of appearances of each math 
operation. We see that our math operation representa- 
tion learning method captures some meaning of these op- 
erations (Figure 4b); the learned math operation embed- 
dings capture the structural changes in math expression in 
ways that match our intuition. For instance, both COM- 
BINE_ADD and COMBINE_MUL can be considered types 
of simplifications, so the Euclidean distance between the 
learned embeddings for these two operations is low. This 
observation is not surprising due to the similar nature 
of these operations. Moreover, COMBINE_ADD, COM- 
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(b) Euclidean distance between learned math 
operation embedding vectors. 


Figure 4. Details of TE+TransE for the math operation 
classification task on the OK subset. These results match 
our intuition on how these math operations are related. 


BINE_MUL, and DISTRIBUTE are often confused with one 
another. These results are also validated by a 2-D visu- 
alization (using t-SNE [42] as a dimensionality reduction 
method) of the learned math operation embeddings in Fig- 
ure 5, where different math operations are mostly well sep- 
arated except for COMBINE_ADD, COMBINE_MUL, and 
DISTRIBUTE. One possible explanation is that these op- 
erations are all applied to one side of the equation during 
a solution step, leaving one side of the equation unchanged, 
while the other operations, such as ADD_SIDE, SUB_SIDE, 
MUL_SIDE, and DIV_SIDE are all applied to both sides 
of the equation. Therefore, this result suggests that tree 
embeddings enable us to characterize a math operation by 
the structural change in math expressions before and after 
a solution step where it is applied. Furthermore, the classi- 
fication accuracy for the DISTRIBUTE operation is signif- 
icantly lower than that for other operations. This result is 
likely due to the fact that the number of steps with this op- 
eration is significantly lower than that for other operations. 


5.4.2 Generalizing to different difficulty levels 

In this experiment, we test the ability of our learned math 
operation representations to generalize to math expressions 
with different levels of complexity in questions at differ- 
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Figure 5. Visualization of learned math expression change 
for a randomly sampled subset of student solution steps in 
2-D and corresponding operations (best viewed in color). 


ent levels of difficulty. Although they are all about equa- 
tion solving, questions at different difficulty levels in Cog- 
nitive Tutor involve math expressions that look very differ- 
ent. For example, in the easiest level (ES_01), the equation 
that needs to be solved in a question looks like x + 5 = 9, 
with only a single variable and without numbers with dec- 
imals. In contrast, in the hardest level (ES_07), a ques- 
tion may contain coefficients with several decimal places 
and multiple variables, such as solve for m in the equation 
m(k — n) = gs. We only compare the GRU-based encoder 
and the tree embedding-based encoder in conjunction with 
a Classifier since they are the best performing methods in 
the previous experiment. Table 3 lists the math operation 
classification accuracy for both methods after training on 
steps at different difficulty levels in the OK subset and testing 
on steps at other difficulty levels (including incorrect ones). 
We see that TE+C overall outperforms GRU+C in almost 
every case. This results suggest that tree embeddings are 
effective at capturing the structural property of a math ex- 
pression. As a result, math operation representations based 
on tree embeddings excel at capturing the structural change 
in math expressions before and after applying a math op- 
eration, leading to better generalizability than GRU-based 
encoding that do not explicitly account for this change. 


5.4.3 Generalizing to different data distributions 

In this experiment, we test the ability of our methods to 
generalize from synthetically generated data to real student 
data. We train different math operation classification meth- 
ods on the 2, 000 synthetically generated steps and test them 
on steps generated by real students in the CogTutor dataset. 
Table 4 shows the mean and standard deviation for each 
method on each real data subset. We see that TE+C signif- 
icantly outperforms GRU+C and CNN-+C on all data sub- 
sets, which is in stark contrast to the previous experiment 
where the difference in performance across all methods is 
much smaller. This observation suggests that tree embed- 
dings are more effective at capturing the semantic/structural 
effect of math operations on math expressions, thus general- 
izing better to different data distributions. Indeed, although 
the synthetically generated steps and the real steps have the 
same set of math operations, the distributions of numbers 
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on Method | 0K ERROR BUG 
ns 01 | GRU+C | 58.82+1.12 | 63.7441.13 | 66.02 + 1.12 
TE+ 76.51 + 0.62 | 84.24+0.87 | 67.49+1.10 
BS 02 | CRUFC | 7105 £112 | 76.66£111 | 69.01 E114 
TE+ 87.89 + 0.34 | 93.96 + 0.72 | 80.44 + 0.78 
BS 03 | CRUFC | 82.39£3.93 | 79.24E 147 [80.01 E167 
TE+ 90.79 + 1.12 | 93.83 +1.32 |} 84.70+1.54 
Bs 04 | GRU+C | 76.72£0.14 | 71.35 £6.12 | 83.32 £2.24 
TE+C | 94.65+0.12 | 92.72+1.32 | 90.99 +1.72 
BS 05 | GRU+C | 81.744 0.33 | 73.36 £1.69 | 78.36 £1.07 
TE+C | 87.66+0.25 | 80.00+ 1.32 | 77.81 + 0.99 
BS 07 | GRUHC | 76.25£3.21 | 73.15 £3.42 | 67.35 £3.62 
TE+C | 79.44+0.62 | 79.29+0.72 | 72.53 + 2.26 


Table 3. Math operation classification accuracy after train- 
ing on steps with different difficulty levels and testing on the 
OK ERROR, and BUG subsets. Tree embedding-based encoding 
outperforms GRU-based encoding. 


100 
90 
<S 80 
s 
3 70 
£ A Lf y OK 
3 60 = £2 e §S 01} 
Z Y/Y €@ vy ES02 
Z 
YF = S03 | 
=0 : yr" e &S04 
Le + ES_05 
407 94 ® S_07| 


10 20 50 100 200 400 800 
Number of real steps 
Figure 6. Math operation classification accuracy for the 
TE+C method when real data is limited. Using synthet- 
ically generated steps as a starting point, we already start 
with acceptable classification accuracy even with few real 
steps generated by students. The performance steadily im- 
proves after more real data becomes available. 


(1,0.5,—7, etc.) and variables (x, u,t, etc.), resulting in a 
mismatch between the data distributions. Tree embedding- 
based methods benefit from the tree-based representations 
of math expressions that can effectively capture structural 
information, making it easy for the learned embeddings of 
math expressions to generalize to unseen data. 


5.4.4 Generalizing from synthetic data 

Ideally, if there is a large amount of training data, i.e., steps 
generated by real students containing different types of math 
expressions and detailed labels on these steps such as the 
math operation(s) applied, the error(s) if a step is incorrect, 
and corresponding feedback, we can simply use that data 
to learn our math operation representations. However, in 
practice, the amount of real data is often limited. Figure 6 
plots the performance of TE+C on all subsets of the Cog- 
Tutor dataset, training on a portion of steps in the subset 
for training and testing on the rest. We see that the perfor- 
mance on math operation classification suffers considerably 
when we only have limited training data. Therefore, syn- 


OK ERROR BUG 
GRU+C 62.89 + 3.93 | 64.06 + 4.70 | 62.94 + 2.24 
TE+C 83.79 + 0.14 | 75.49+ 0.90 | 75.16 + 0.55 
CNN-C 51.12 + 1.64 | 45.52 + 0.98 | 59.82 + 1.68 


TE + TransE | 80.17 + 2.32 | 71.86 + 3.24 | 72.32 + 2.72 
TE + TransR | 82.22 + 2.88 | 73.83 + 3.46 | 74.85 + 3.23 


Table 4. Math operation classification accuracy for all meth- 
ods training on 7,000 synthetically generated steps and test- 
ing on different subsets of the CogTutor dataset. Tree 
embedding-based methods significantly outperform other 
methods, showing better ability to generalize to different 
data distributions. 
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Figure 7. Math classification accuracy (difference in per- 
centage) for TE+C, pre-training on synthetic data before 
fine-tuning on real data versus training only on real data. 
When real data is limited, pre-training on synthetic data 
results in significantly better performance. 


thetically generated data can play a vital role in improving 
their performance under this circumstance; the strategy of 
fine-tuning models trained on synthetically generated data 
using a small amount of real data can be effective. Specif- 
ically, we start with a pre-trained math operation classifi- 
cation model on the 7000 synthetically generated steps and 
fine tune it on a small number of real steps by doing gradi- 
ent descent on these steps for 10 epochs. Figure 7 plots the 
improvement in math operation classification accuracy for 
the fine-tuned model over the model that trains on only real 
data of various amounts on all data subsets. We see that 
the pre-trained models always performs better, with signif- 
icant improvement when the real data is extremely limited. 
This result suggests that i) effectively leveraging synthet- 
ically generated data can mitigate the problem of limited 
real data and ii) our math operation representation learn- 
ing methods are capable of generalizing across different data 
distributions (synthetic — real). 


5.4.5 Feedback type classification 

In this experiment, we evaluate our math operation rep- 
resentation learning methods on the feedback type classi- 
fication task. These feedback items were automatically de- 
ployed by Cognitive Tutor for incorrect steps in the BUG sub- 
set. We pre-processed these steps and grouped the detailed 
feedback items according to the students’ errors that each 
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Method Accuracy 

GRU+C 75.35 + 1.41 
TE+C 78.71 41.74 
CNN+C 67.23 + 1.54 
TE + TransE | 69.15+1.13 
TE + TransR | 73.21 + 1.63 


Table 5. Feedback type classification accuracy for all meth- 
ods on the BUG subset. Tree embedding-based encoding out- 
performs other encoding methods while TransE and TransR 
frameworks do not reach similar performance levels due to 
shortage of training data. 


feedback item addresses and narrowed it down to a total 
of 24 types that occur multiple times. We perform 5-fold 
cross validation on this subset. Table 5 shows the averages 
and standard deviations of feedback classification accuracy 
for all methods on this task across the five folds. We see 
that due to the limited size of the BUG subset (only 5,744 
steps) and the high number of classes (24), all method per- 
form worse than they do on the math operation classification 
task. Specifically, we see that the tree embedding-based en- 
coder in conjunction with a classifier performs best while 
GRU-based encoding also performs well. This result shows 
that although tree embeddings are superior at capturing the 
meaning of math expressions, their advantage over simple 
encoding methods such as GRU-based encoding decreases 
due to increased noise in the data; some math expressions 
submitted by students in incorrect steps are ill-posed and 
do not make sense. Using the TransE and TransR frame- 
works result in slightly worse performance than classifiers 
since these methods explicitly learn a representation for each 
math operation, which limits their performance on this task 
due to the shortage of training data. However, since they 
capture the structural difference in math expressions before 
and after the step, they can cancel out some of the noise in 
erroneous steps, resulting in acceptable performance. 


5.5 Discussions 

Overall, we find that the GRU-based and tree embedding- 
based math expression encoders in conjunction with a classi- 
fier perform almost equally well in most situations, while the 
CNN-based encoder performs worse. The tree embedding- 
based encoder has stronger generalizability across different 
data distributions. We believe that as the math expressions 
and operations get more complicated, methods that lever- 
age the tree structure of math expressions would be more 
advantageous. We also observe that TransR outperforms 
TransE most of the time, although in some experiments us- 
ing TransE and TransR to explicitly learn math operation 
embeddings lead to slightly worse performance than clas- 
sifiers using implicit representations of math expressions. 
However, TransE and TransR are much more powerful and 
enable us to study more tasks such as clustering solution 
steps and identifying typical student errors and learning so- 
lution strategies; See Section 6 for a detailed discussion. 


6. CONCLUSIONS AND FUTURE WORK 


In this paper, we developed a series of methods to learn 
representations of math operations by observing how math 
expressions change as a result of these operations in step-by- 


step solutions to open-ended math questions. Our methods 
leverage math expression encoding methods that map tree- 
structured math expressions into a math embedding vector 
space. We demonstrated the effectiveness of our methods 
on a dataset containing detailed student solution steps to 
equation solving questions in the Cognitive Tutor system on 
two tasks: i) classifying the math operation applied in each 
step and ii) classifying the feedback the system deploys for 
each incorrect step. Results show that our learned math 
operation representations are meaningful and can often ef- 
fectively generalize across different data distributions such 
as questions with different difficulty levels. 


However, the success of our methods heavily depends on the 
availability of diverse large-scale training data. The Cogni- 
tive Tutor dataset that we used in this work represents a 
heavily restricted solution process since the list of math op- 
erations a student can apply in a step is pre-defined. There- 
fore, additional work has to be done to extend our method 
to truly open-ended step-by-step solution processes that are 
less structured. Moreover, our methods are restricted to a 
single solution step only and do not consider the relation- 
ship across multiple steps, which is related to another im- 
portant aspect of solving open-ended math questions: the 
overall solution strategy, i.e., which math operation to apply 
next. Furthermore, in both classification tasks, using tree 
embeddings to encode math expressions in conjunction with 
a classifier outperforms explicitly learning vectorized repre- 
sentations of math operations in the TransE and TransR 
frameworks. However, these explicit representations may 
enable us to perform other tasks such as Nevertheless, our 
work provides a series of tools to analyze the math expres- 
sions students write down in their solutions by bridging the 
gap between symbolic math representations with continuous 
representations in vector spaces, enabling the use of state- 
of-the-art neural network-based methods. We believe that 
this work can potentially open up a new line of research that 
studies how to automatically analyze student solutions for 
grading and feedback purposes. 


There are many avenues of future work. First, since most 
real-world open-ended solutions contain a mixture of math 
expressions and text, there is a need to learn a joint represen- 
tation of math expressions and text in a shared embedding 
space. Second, this joint representation will enable us to 
train automated feedback generation methods in an end-to- 
end manner, using sequence-to-sequence learning methods 
[41]. Third, using learned math expression representations 
as the states and learned math operation representations 
from the TransE and TransR frameworks as the state transi- 
tion model, we can apply reinforcement learning and inverse 
reinforcement learning methods to learn solution strategies, 
i.e., which math operation to apply in the next step. We can 
also study solution strategies employed by real students [33] 
and diagnose their errors and design corresponding feedback 
mechanisms to improve their learning outcomes. These fu- 
ture work directions will enable us to tap into the full poten- 
tial of explicit math operation representations, which is not 
fully demonstrated in this paper: on the CogTutor dataset, 
the only relevant real-world dataset we found, we could only 
evaluate these explicit representations on the math opera- 
tion and feedback prediction tasks, where they may not out- 
perform tree embedding-based classification-based methods. 
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